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Abstract 

General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for nu¬ 
merous applications such as algebraic multigrid method (AMG), breadth first search and shortest 
path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM imple¬ 
mentation has to handle extra irregularity from three aspects: (1) the number of nonzero entries 
in the resulting sparse matrix is unknown in advance, (2) very expensive parallel insert opera¬ 
tions at random positions in the resulting sparse matrix dominate the execution time, and (3) load 
balancing must account for sparse data in both input matrices. 

In this work we propose a framework for SpGEMM on GPUs and emerging CPU-GPU 
heterogeneous processors. This framework particularly focuses on the above three problems. 
Memory pre-allocation for the resulting matrix is organized by a hybrid method that saves a 
large amount of global memory space and efficiently utilizes the very limited on-chip scratch¬ 
pad memory. Parallel insert operations of the nonzero entries are implemented through the GPU 
merge path algorithm that is experimentally found to be the fastest GPU merge approach. Load 
balancing builds on the number of necessary arithmetic operations on the nonzero entries and is 
guaranteed in all stages. 

Compared with the state-of-the-art CPU and GPU SpGEMM methods, our approach delivers 
excellent absolute performance and relative speedups on various benchmarks multiplying matri¬ 
ces with diverse sparsity structures. Eurthermore, on heterogeneous processors, our SpGEMM 
approach achieves higher throughput by using re-allocatable shared virtual memory. 

Keywords: 

sparse matrix, sparse matrix-matrix multiplication, linear algebra, GPU, heterogeneous 
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1. Introduction 

General matrix-matrix multiplication (GEMM) is one of the most crucial operations in com¬ 
putational science and modeling. The operation multiplies a matrix A of size mxk with a matrix 
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B of size kxn and gives a resulting matrix C of size m x n. In many linear solvers and graph 
problems such as algebraic multigrid method (AMG) 10, breadth first search |0, finding short- 


problems sucn as algebraic multigrid metnod (AMUj breadth hrst search lUI, nnaing snort- 
est path 101, colored ihtersectioh 10 ahd sub-graphs 10]^ it is required to exploit sparsity of the 
two iuput matrices aud the resultihg matrix because their deuse forms hormally heed huge stor¬ 
age space aud computatioh cost for the zero eutries. Therefore geueral sparse matrix-matrix 
multiplicatioh (SpGEMM) becomes a commou buildihg block ih these applicatiohs. 

Compared to CPUs, moderu graphics processihg uhits (GPUs) promise much higher peak 
floatihg-poiht performahce aud memory baudwidth. Thus a lot of research has cohceutrated 
Oh GPU accelerated sparse matrix-dehse vector multiplicatioh 101810 ahd sparse matrix-dehse 
matrix multiplicatioh iGsini] ahd achieved relatively attractive performahce. However, despite 
the prior achievemehts oh these GPU sparse BLAS routihes, massive parallelism ih GPUs is 
still sighificahtly uhderused for the SpGEMM algorithm, because it has to haudle three more 
challeugihg problems: (1) the humber of houzero eutries ih the resultihg matrix is uhkhowh ih 
advahce, (2) very expeusive parallel ihsert operatiohs at raudom positiohs ih the resultihg matrix 
domihate the executioh time, aud (3) load balaucihg must accouht for sparse data ih both iuput 
matrices with diverse sparsity structures. 

Previous GPU SpGEMM methods 10 d [B Q E [13] have proposed a few solutious 
for the above problems aud demoustrated relatively good time aud space complexity. However, 
the experimeutal results showed that they either ouly work best for fairly regular sparse matri¬ 


ces mi 1131 
structures 


16l], or brihg extra high memory overhead for mah'ices with some specific sparsity 
i 4I15[] . Moreover, iu the usual seuse, uoue of these methods cau coustautly out¬ 


perform well optimized SpGEMM approach id for multicore CPUs. 

Our work described iu this paper particularly focuses ou improviug GPU SpGEMM per¬ 
formahce for matrices with arbitrary irregular sparsity structures by proposiug more efficieut 
methods to solve the above three problems ou GPUs aud emergiug CPU-GPU heterogeueous 
processors. 

Ih this paper, we make the followiug coutributious: 


• A framework for fast SpGEMM. We desigu a 4-stage framework for implemeutiug SpGEMM 
Oh mauycore platforms iucludiug homogeueous GPUs aud heterogeueous processors com¬ 
posed of CPU cores, GPU cores aud shared virtual memory. This framework effectively 
orgauizes memory allocatiou, load balaucihg aud GPU keruel lauuches. 

• A hybrid method for the resulting matrix pre-allocation. We preseut a hybrid method 
that ihitially allocates memory of upper bouud size for short rows aud progressively allo¬ 
cates memory for loug rows. The experimeutal results show that our method saves a large 
amouht of global memory space aud efficieutly utilizes the very limited ou-chip scratchpad 
memory. 

• Parallel insert operations through fast merging. We propose au efficieut parallel iusert 
method for loug rows of the resultihg matrix by usiug the fastest merge algorithm available 
Oh GPUs. We make au experimeutal evaluatiou aud choose GPU merge path algorithm 
from five caudidate GPU merge approaches. 

• Heuristic-based load balancing. We develop a load balaucihg orieuted heuristic method 
that assighs rows of the resultihg matrix to multiple bius with differeut subsequeut compu- 
tatioual methods. Our approach guarautees load balaucihg iu all calculatiou steps. 
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Our framework and corresponding algoritlims delivers excellent performance in two exper¬ 
imental scenarios: (1) calculating triple matrix Galerkin products (i.e., P^AP) in AMG for 2D 
and 3D Poisson problems, and (2) computing matrix squaring (i.e., A^) on a benchmark suite 
composed of 23 sparse matrices with diverse sparsity structures. 

In the context of Galerkin products, our method constantly outperforms the state-of-the- 
art GPU SpGEMM methods in two vendor supplied libraries cuSPARSE and CUSP. Average 
speedups of 1.9x (up to 2.6x) and 1.7x (up to 2.7x) are achieved when compared to cuSPARSE 
and CUSP, respectively. 

In the context of matrix squaring, more comparison methods are included. First, on two 
nVidia GPUs (i.e., a GeEorce GTX Titan Black and a GeEorce GTX 980), compared with cuS¬ 
PARSE and CUSP, our approach delivers on average 3.lx (up to 9.5x) and 4.6x (up to 9.9x) 
speedups, respectively. Second, compared to a recently developed CUDA-specific SpGEMM 
method RMerge IB, our method offers on average 2.5x (up to 4.9x) speedup on the nVidia 
GeEorce GTX 980 GPU. Third, compared to the SpGEMM method in the latest Intel Math Ker¬ 
nel Library (MKL) on a six-core Xeon E5-2630 CPU and quad-channel system memory, our 
method gives on average 2.4x (up to 5.2x) and 2.lx (up to 4.2x) speedups on the nVidia GeEorce 
GTX 980 GPU and an AMD Radeon R9 290X GPU, respectively. 

Eurthermore, our approach can utilize re-allocatable memory controlled by CPU-GPU het¬ 
erogeneous processors. On an AMD A10-7850K heterogeneous processor, compared to merely 
using its GPU cores, our framework delivers on average 1.2x (up to 1.8x) speedup while utilizing 
re-allocatable shared virtual memory in the system. 

2. Preliminaries 

2.7. SpGEMM Overview 

Eor the sake of generality, the SpGEMM algorithm description starts from discussion of the 
GEMM and gradually takes sparsity of the matrices A, B and C into consideration. Eor the matrix 
A, we write Ojj to denote the entry in the /th row and the y'th column of A and a,, to denote the 
vector consisting of the /th row of A. Similarly, the notation denotes the y'th column of A. In 
the GEMM, the /th row of the resulting matrix C can be defined by 

c/h! — (u:/* ■ b„,\, ci/H! ■ • - ■, ' b:^p), 

where the operation ■ is the dot product of the two vectors. 

We first give consideration to the sparsity of the matrix A. Without loss of generality, we 
assume that the /th row of A only consists of two nonzero entries in the kth and the /th column, 
respectively. Thus ai, becomes (a/j, a,;). Since all other entries are zeros, we do not record them 
explicitly and ignore their influence on the dot products in the calculation of the /th row of C. 
Then we obtain 

Cit = {aikbk\ + aiibn,aikbk2 + auba, ■■■, Oikb^p + aubip). 

We can see in this case, only entries in the kth and the /th row of B have contribution to the 
/th row of C. Then row vector form instead of column vector form is used for the matrix B. So 
we obtain 

C/h! — OikbkiF -t Oiibi,^. 

Since the matrix B is sparse as well, again without loss of generality, we assume that the 
kth row of B has only two nonzero entries in the rth and the fth column, and the /th row of B 
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also has only two nonzero entries in the ith and the fth column. So the two rows are given by 
bk* = {bkr, bkr) and bi^ = {b^, bu). Then 


C;. = Qikibkr, bkt) + auibis, bi,). 

Because the matrix C is also sparse and the /th row of C only has three nonzero entries in the 
rth, the ith and the fth column, the row can be given by 


where a,- - Gikbkr, cis - aubis and a, - Uikbut + aubi,. 

In general there are more nonzero entries per rows of the matrices A, B and C. But from the 
above derivation we can see that the SpGEMM can be represented by operations on row vectors 
of the matrices. Therefore, in this work we store all sparse matrices in compressed sparse row 
(CSR) format. The CSR format of a matrix consists of three separate arrays: (1) row pointer 
array of size n + \, where n is the number of rows of the matrix, (2) column index array of size 
nnz, where nnz is the number of nonzero entries of the matrix, and (3) value array of size nnz. 
Hence the overall space complexity of the CSR format is 0{n + nnz). Actually compressed sparse 
column (CSC) format is also widely used for sparse matrices stored in column-major order ifisl] . 
The SpGEMM in the CSC format is almost the same as in the CSR format except rows are 
changed to columns and vice versa. 

The above CSR-based SpGEMM algorithm can be performe d by pseudocode in Algorithm[T] 
An early description of this algorithm was given by Gustavson 1191]. 


Algorithm 1 Pseudocode for the SpGEMM. 

1: for each in the matrix A do 

2: set c,» to 0 

3: for each nonzero entry aij in a,-, do 

4: load bjt 

5: for each nonzero entry bjk in bj^ do 

6: value «— aijbjk 

7: if Cik t Ci, then 

8: insert Cik to c,* 

9: Cik ^ value 

10 : else 

11: Cik Cik + value 

12 : end if 

13: end for 

14: end for 

15: end for 


2.2. Prior SpGEMM Algorithms 

A classic CPU SpGEMM algorithm, also known as Matlab algorithm, was proposed by 
Gilbert et al. lIT^ . This approach uses a dense vector-based sparse accumulator (or SPA) and 
takes Oi flops + nnz(B) + n) time to complete the SpGEMM, where flops is defined as the num¬ 
ber of necessary arithmetic operations on the nonzero entries, nnz{B) is defined as the number 
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of nonzero entries in the matrix B, and n is the number of rows/columns of the input square 
matrices. Matam et al. 1201] developed a similar Matlab algorithm implementation for GPUs. 
Sulatycke and Ghose Ell] proposed a cache hits-oriented algorithm runs in relatively longer time 
Oiflops + rP'). A fast serial SpGEMM algorithm with time complexity + was 

developed by Yuster and Zwick fiStl . Bulug and Gilbert ll23ll presented an SpGEMM algorithm 
with time complexity independent to the size of the input matrices under assumptions that the 
algorithm is used as a sub-routine of 2D distributed memory SpGEMM and the input matrices 
are hypersparse {nnz < n). 

Recent GPU-based SpGEMM algorithms showed better time complexity. The SpGEMM 


algorithm in the cuSPARSE library 0121113|] utilized GPU hash table for the insert operations 


(lines 7-11 in Algorithm [T]i. So time complexity of this approach is Oi flops) on average and 
Oi flops nnzr{C)) in the worst case, where nnzr(C) is defined as the average number of nonzero 
entries in the rows of the matrix C. Because the algorithm allocates one hash table of hxed size 
for each row of C, the ^ace complexity is 0{nnz(A) -n nnz{E) + n + nnz(C)). 

The CUSP library | 


14|] developed an SpGEMM method called expansion, sorting and com¬ 


pression (ESC) that expands all candidate nonzero entries generated by the necessary arithmetic 
operations (line 6 in Algorithm[T]i into an intermediate sparse matrix C, sorts the matrix by rows 
and columns and compresses it into the resulting matrix C by eliminating entries in duplicate 
positions. By using GPU radix sort algorithm (with linear time complexity while size of the 
index data type of the matrices is fixed) and prefix-sum scan algorithm (with linear time com¬ 
plexity) as building blocks, time complexity of the ESC algorithm is Oiflops + nnz(C) + nnz(C)). 
Since nnziC) equals half of flops, the ESC algorithm takes the optimal Oiflops) time. Dalton 
et al. fisll improved the ESC algorithm by executing sorting and compression on the rows of C, 
but not on the entire matrix. Therefore fast on-chip memory has a chance to be utilized more 
efficiently. The improved method sorts the very short rows (of size no more than 32) by using 
sorting network algorithm (with time complexity OinnzriC) log^(nnzr(C)))) instead of the radix 
sort algorithm which is mainly efficient for long lists. So the newer method is more efficient 
in practice, even though its time complexity is not lower than the original ESC algorithm. Be¬ 
cause both of the ESC algorithms allocate an intermediate matrix C, they have the same space 
complexity OinnziA) + nnziB) -i- nnziC) -i- nnziC)). 

RMerge algorithm, recently proposed by Gremse et al. 13], iteratively merges rows in the 
matrix B into the resulting matrix C. Because this approach underutilizes thread interaction 
and generates one intermediate sparse matrix for each iteration step, it works best for input 
matrices with evenly distributed short rows. Eor irregular input matrices, load imbalance and 
large memory allocation make this method inefficient. 

2.3. Terminology Definition for GPU Programming 

Because CUDA and OpenCL are both widely used in GPU programming and they actually 
deliver comparable performance Edil . our SpGEMM algorithm support both of them. We use 
CUDA implementation on nVidia GPUs and OpenCL implementation on AMD GPU in our 
SpGEMM evaluation. 

Eor simplicity, we dehne the following unihed terminologies: (1) thread denotes thread in 
CUDA and work item in OpenCL, (2) thread bunch denotes warp in nVidia GPU and wave- 
front in AMD GPU, (3) thread group denotes thread block or cooperative thread array (CTA) 
in CUDA and work group in OpenCL, (4) core denotes streaming multiprocessor (SMX) or 
Maxwell streaming multiprocessor (SMM) in nVidia GPU and compute unit in AMD GPU, and 
(5) scratchpad memory denotes shared memory in CUDA and local memory in OpenCL. 
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3. Performance Considerations 


3.1. Memory Pre-allocation For the Resulting Matrix 

Compared to SpGEMM, oth er sp arse matrix multiplication operations (e.g., multiplication of 
sparse matrix and dense matrix lU HI 2^1 its special case sparse matrix-vector multiplica¬ 
tion iSSSIlS 27]) pre-allocate a dense resulting matrix or vector. Thus the size of the result of 
the multiplication is trivially predictable, and the corresponding entries are stored to predictable 
memory addresses. However, because the number of nonzero entries in the resulting sparse ma¬ 
trix C is unknown in advance, precise memory allocation of the SpGEMM is impossible before 
real computation. Moreover, physical address of each new entry is unknown either (consider line 
7 in Algorithm [1] the position k is only a column index that cannot trivially map to a physical 
address on memory space). 

To solve this problem, the previous SpGEMM algorithms proposed four different solutions; 

(1) precise method, (2) probabilistic method, (3) upper bound method, and (4) progressive 
method. 

The first method, precise method, pre-computes a simplified SpGEMM in the same com¬ 
putational pattern. We can imagine that multiplication of sparse boolean matrices is more effi¬ 
cient than multiplication of sparse floating-point matrices. RMerge algorithm and the SpGEMM 
methods in cuSPARSE and MKL are representatives of this approach. Even though the pre- 
computation generates precise size of nnz{C), this method is relatively expensive since the SpGEMM 
operation in the same pattern is executed twice. 

The second method, probabilistic method, estimates an imprecise nnz(C). This group of 
approaches ||28. 3 based on random sampling and probability analysis on the input 

matrices. Since they do not guarantee a safe lower bound for the resulting matrix C and extra 
memory has to be allocated while the estimation fails, they were mostly used for estimating the 
shortest execution time of multiplication of multiple sparse matrices. 

The third method, upper bound method, computes an upper bound of the number of nonzero 
entries in the resulting matrix C and allocates corresponding memory space. Numerically, the 
upper bound size equals nnz{C), or half of flops, the number of necessary arithmetic operations. 
The ESC algorithms use this method for memory pre-allocation. Even though this approach 
saves cost of the pre-computation in the precise method, it brings another problem that the in¬ 
termediate matrix C may be too large to fit in the device global memory. Since the SpGEMM 
algorithm does not take into consideration cancellation that eliminates zero entries generated by 
arithmetic operations, the resulting matrix is normally larger than the input matrices. Table |2] 
shows that nnz(C) is much larger than nnz(C) while squaring some matrices. Eor example, the 
sparse matrix Wind Tunnel generates 626.1 million nonzero entries (or 7.5 GB memory space for 
32-bit index and 64-bit value) for the intermediate matrix C while the real product C (i.e., A^) 
only contains 32.8 million nonzero entries. Although the upper bound method can partition the 
intermediate matrix C into multiple sub-matrices, higher global memory pressure may reduce 
overall performance. 

The last method, progressive method, first allocates memory of a proper size, starts sparse 
matrix computation and re-allocates the buffer if larger space is required. Some CPU sparse 
matrix libraries use this method. Eor instance, sparse matrix computation in the Matlab iQ 
increases the buffer by a ratio of 50% if the current memory space is exhausted. 

Since the upper bound method sacrifices space efficiency for the sake of improved perfor¬ 
mance and the progressive method is good at saving space, we use a hybrid method composed of 
the both approaches. However, compared to the relatively convenient upper bound method, it is 
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hard to directly implement a progressive method for discrete GPUs. The reason is that although 
modern GPU devices have the ability of allocating device global memory while kernels are run¬ 
ning, they still cannot re-allocate device memory on the fly. We will describe our hybrid method 
designed for discrete GPUs in the next section. 

On the other hand, emerging heterogeneous processors, composed of multiple CPU cores 
and GPU cores in one chip , supply both flexibility and efficiency. AMD Accelerated Process¬ 


ing Units (APUs) 0311 l32ll . Intel multi-CPU and GPU system-on-a-chips (SoC) devices 
nVidia Echelon heterogeneous GPU architecture ll34l] . and many mobile processors (e.g., nVidia 
Tegra ll^ and Qualcomm Snapdragon 1361) are representatives of the heterogeneous processor. 
Heterogeneous system architecture (HSA) 11371] and OpenCL 2.0 OSll deliver programming tools 
for some heterogeneous processors. In this architecture, integrated GPU cores can directly use 
system memory allocated by the CPU part. Then data transfer through connection interfaces 
such as PCIe link can be avoided to obtain higher performance if^ . This gives our SpGEMM 
algorithm a chance to let integrated GPUs use re-allocatable system memory for a better overall 
performance. Later on, we will show the corresponding performance gain by using an AMD 
APU. 


3.2. Parallel Insert Operations 

As shown in Algorithm [T] for each trivial arithmetic computation (line 6), one much more 
expensive insert operation (lines 7-11) is required. To the best of our knowledge, none of the 
previous GPU SpGEMM methods takes into account that the input sequence (line 4) is ordered 
because of the CSR formafl One of our algorithm design objectives is to efficiently utilize this 
property. Based on experiments by Kim et al. S, as the SIMD units are getting wider and wider, 
merge sort methods will outperform hash table methods on the join-merge problem, which is a 
similar problem in the SpGEMM. Then our problem converts to finding a fast GPU method for 
merging sorted sequences. Later on we will describe our strategy in detail. 


3.3. Load Balancing 

Because distribution patterns of nonzero entries in both input sparse matrices can be very 
diverse (consider plots of the matrices in Table|2]i, input space-based data decomposition 112, [21 1 
normally does not bring efficient load balancing. One exception is that computing SpGEMM for 
huge sparse matrices on large scale distributed memory systems, 2D and 3D decomposition 
on input space methods demonstrated good load balancing and scalability by utilizing efficient 
communication strategies SSill 1^ . However, in this paper we mainly consider load balancing 
for fine-grained parallelism in GPU and CPU-GPU shared memory architectures. 

Therefore we use the other group of load balancing methods based on output space decom¬ 
position. Dalton et al. in presented a method that sorts rows of the intermediate matrix C, 
divides it into 3 sub-matrices that include the rows in different size ranges, and uses differenti¬ 
ated ESC methods for the sub-matrices. We have a similar consideration, but our implementation 
is completely different. We do not strictly sort rows of the intermediate matrix C but just assign 
rows to a fixed number of bins through a much faster linear time traverse on CPU. Moreover, 
we decompose the output space in a more detailed way that guarantees much more efficient 
load balancing. We will demonstrate that our method is always load balanced in all stages for 
maximizing resource utilization of GPUs. 


* Actually according to the CSR format standard, the column indices in each row do not necessarily have to be sorted. 
But most implementations choose to do so, thus our method reasonably makes this assumption. 
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4. Methodology 

4.1. Framework and Algorithm Design 

Our SpGEMM framework includes four stages: (1) calculating upper bound, (2) binning, (3) 
computing the resulting matrix, and (4) arranging data. Figure[T]plots this framework. 


Calculating 
upper bound 


Binning 


Stage 1 



Figure 1: The SpGEMM framework composed of four stages. 


The first stage, calculating upper bound, generates the upper bound number of nonzero 
entries in each row of the resulting matrix C. We create an array U of size m, where m is 
the number of rows of C, for the upper bound sizes of the rows. We use one GPU thread for 
computing each entry of the array U. Algorithm|2]describes this procedure. 


Algorithm 2 Pseudocode for the hrst stage on GPUs. 
1: for each entry m, in U in parallel do 

2 : Hi ^— 0 

3: for each nonzero entry aij in do 

4: Hi <— Hi + nnz(bjt) 

5: end for 

6 : end for 


The second stage, binning, deals with load balancing and memory pre-allocation. We first 
allocate 38 bins and put them into hve bin groups. The bins contain the indices of the entries in 
the array U and present as one array of size m with 38 segments. Then all rows are assigned to 
corresponding bins according to the number of nonzero entries. Finally, based on the sizes of the 
bins, we allocate a temporary matrix for nonzero entries in the resulting matrix C. 

The hrst bin group includes one bin that contains the indices of the rows of size 0. The 
second bin group also only has one bin that contains the indices of the rows of size 1. Because 
the rows in the hrst two bins only require trivial operations, they are excluded from subsequent 
more complex computation on GPUs. Thus a better load balancing can be expected. 

The third bin group is composed of 31 bins that contain the indices of the rows of sizes 2-32, 
respectively. Since the sizes of these rows are no more than the size of a single thread bunch 
(32 in current nVidia GPUs or 64 in current AMD GPUs) and these rows require non-trivial 
computation, using one thread bunch or one thread group for each row cannot bring efficient 
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instruction throughput on GPUs. Therefore, we use one thread for each row. Further, because 
each bin only contains the rows of the same upper bound size, the bins can be executed separately 
on GPUs with different kernel programs for efficient load balancing. In other words, 31 GPU 
kernel programs will be executed for the 31 bins, if not empty. 

The fourth bin group consists of 4 bins that contain the indices of the rows located in size 
ranges 33-64, 65-128, 129-256 and 257-512, respectively. The rows of these sizes are grouped 
because of three reasons: (1) each of them is large enough to be efficiently executed by a thread 
group, (2) each of them is small enough for scratchpad memory (48 kB per core in current nVidia 
Kepler GPUs, 96 kB per core in current nVidia Maxwell GPUs and 64 kB per core in current 
AMD Graphics Core Next, or GCN, GPUs), and (3) the final sizes of these rows in the resulting 
matrix C are predictable in a reasonable small range (no less than the lower bound of size 1 
and no more than the corresponding upper bound sizes). Even though the rows in each bin do 
not have exactly the same upper bound size, a good load balancing still can be expected because 
each row is executed by using one thread group and inter-thread group load balancing is naturally 
guaranteed by the GPU low-level scheduling sub-systems. 

The fifth bin group includes the last bin that contains the indices of the rest of the rows of 
size larger than 512. These rows have two common features: (1) their sizes can be too large 
(recall nnzr{C) in Table |2]i to fit in the scratchpad memory, and (2) predicting the final sizes of 
these rows to a small range (scratchpad memory level) is not possible in advance. Therefore, we 
execute them in a unified progressive method described later. Again because we use one thread 
group for each row, load balancing is naturally guaranteed. 

Since we do not use precise method for memory pre-allocation, a temporary memory space 
for the resulting matrix C is required. We design a hybrid method that allocates a CSR format 
sparse matrix C of the same size of the resulting matrix C as temporary matrix. We set nnz(ci*) 
to M, while the row index i is located in the bin groups 1-4 because compared with modern GPU 
global memory capacity, the total space requirement of these rows is relatively small. For the 
rows in the bin group 5, we set nnz(ch) to a fixed size 256 since normally this is an efficient 
working size for the scratchpad memory. Therefore, we can see that if all of the indices of the 
rows are in the bin groups 1^, our hybrid method converts to the upper bound method, on the 
other extreme end, our method converts to the progressive method. But generally, we obtain 
benefits from the both individual methods. The stage 2 is executed on CPU since it only requires 
a few simple linear time traverses, which are more efficient for the CPU cache sub-systems. The 
pseudocode is shown in Algorithm^ 

The third stage, computing the resulting matrix, generates nnz(cit) and the final nonzero 
entries stored in the temporary matrix C. 

For the rows in the bin groups 1-2, we simply update the numbers of corresponding nonzero 
entries. For the rows in the bin groups 3-5, we use three totally different methods: (1) heap 
method, (2) bitonic ESC method, and (3) merge method, respectively. Note that each bin has a 
counter (at the host side) that records the number of rows included. So the host can easily decide 
if a GPU kernel will be issued for a certain bin. In other words, our approach only issue kernels 
for non-empty bins. 

The heap method first creates an empty implicit index-value pair heap (or priority queue) of 
the upper bound size for each row in the bin group 3. The heaps are located in the scratchpad 
memory and collect all candidate nonzero entries for corresponding rows. Then each heap ex¬ 
ecutes a heapsort-like operation to generate an ordered sequence located in the tail part of the 
heap. The difference between this operation and the classic heapsort operation is that the entries 
in the resulting sequence are duplicate-free while the initial heap includes duplicate entries. In 
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Algorithm 3 Pseudocode for the second stage on a CPU core. 

1 

for each entry m, in t/ do 


2 

if M; = 0 then 

> The 1 St bin group 

3 

insert i to bino 


4 

nnz(cit) <— 0 


5 

else if M, = 1 then 

> The 2nd bin group 

6 

insert i to bin\ 


7 

nnz(cit) <— 1 


8 

else if Uj > 2 && m, < 32 then 

> The 3rd bin group 

9 

insert i to 


10 

nnz(cu) <— M; 


11 

else if M, > 33 && m, < 64 then 

> The 4th bin group 

12 

insert i to bin-i-i 


13 

nnz(cit) <— Ui 


14 

else if Ui > 65 && m, <128 then 

> The 4th bin group 

15 

insert i to bin 24 


16 

nnz(cit) <— Ui 


17 

else if Ui >129 && m,- < 256 then 

> The 4th bin group 

18 

insert i to bin^s 


19 

nnz(cit) <— Ui 


20 

else if Ui > 257 && m,- <512 then 

> The 4th bin group 

21 

insert i to binj,^ 


22 

nnz(ci*) Ui 


23 

else if M, > 512 then 

> The 5th bin group 

24 

insert i to bin 2 ^ 


25 

nnz(cit) <— 256 


26 

end if 


27 

end for 


28 

nnz(C) <— Yj nnz(cit) 



each delete-max step in our variant heapsort, the root node and the first entry of the resulting 
sequence are fused if they share the same index; otherwise the root node is inserted to the head 
part of the sequence. Our method is also distinguished from a heap-based sparse accumulator 
given by Gilbert et al. by the mechanism of eliminating duplicate entries. Figure|2]gives two 
steps of an example of our heap method. Finally, the sorted sequence without duplicate indices is 
generated in the scratchpad memory and saved to the matrix C in the global memory. In addition, 
the numbers of nonzero entries in the rows of the resulting matrix C are updated to the sizes of 
the corresponding resulting sequences. 

For the rows in each bin of the bin group 4, a typical ESC algorithm is used. The method first 
collects all candidate nonzero entries to an array in the scratchpad memory, then sorts the array 
by using basic bitonic sort and compresses duplicate indices in the sequence by using prefix-sum 
scan. Figure[3]shows an example of this procedure. Finally, a sorted sequence without duplicate 
indices is generated in the scratchpad memory and saved to the matrix C, and the numbers of 
nonzero entries in the rows are updated. 

For the rows in the bin group 5, our method inserts each input nonzero entry to the corre- 
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(a) 


(b) 


(c) 


Figure 2: Two steps of an example of the heap method. From (a) to (b), the root entry is fused to the first entry in 
resulting sequence since they share the same index. From (b) to (c), the root entry is inserted to the sequence since they 
have different indices. After each step, the heap property is reconstructed. 


idx=1 idx=3 idx=5 idx=2 
val=1 I val=3 I val=3 I val=6 


idx=5 

val=1 


idx=2 

val=2 I 


idx=1 idx=2 idx=2 idx=3 idx=5 idx=5 
val=1 I val=6 I val=2 I val=3 val=3 val=1 I 


after 

bitonic sort 


after 

tagging duplicates 


idx=1 

idx=2 

idx=2 

idx=3 

idx=5 

idx=5 



val=1 

val=8 

val=2 

val=3 

val=4 

val=1 




after 

fusing duplicates 


after 

prefix-sum scan 


idx=1 

idx=2 

idx=3 

idx=5 

val=1 

val=8 

val=3 

val=4 


after compression 
(final status) 


Figure 3: An example of the bitonic ESC method. 


spending row of the resulting matrix C (lines 7-11 in Algorithm[TJ in parallel. We can see that the 
input sequence (the candidate nonzero entries) and the resulting sequence (the selected nonzero 
entries in the current row of C) should always be kept ordered and duplicate-free because of the 
CSR format. Therefore, we can convert the parallel insert operations to parallel merge operations 
that merge ordered sequences and the final resulting sequence is ordered and duplicate-free. 

Each parallel merge operation can be split into multiple sub-steps: (1) a binary search op¬ 
eration on the resulting sequence for fusing entries with the same indices and tagging them, (2) 
a prefix-sum scan operation on the input sequence for getting continuous positions in the incre¬ 
mental part of the resulting sequence, (3) copying non-duplicate entries from the input sequence 
to the resulting sequence, and (4) merging the two sequences in one continuous memory space. 
Figure |4] shows an example of this procedure. After all input sequences are merged into one 
resulting sequence, it is saved to the matrix C. Then the numbers of nonzero entries in the rows 
are updated. 

As we allocate a limited scratchpad memory space for the resulting sequence, a potential 
overflow may happen. In this case, we first compare total size of the two sequences (note that 
the input sequence is in the thread registers, but not in the scratchpad memory yet) with the 
allocated size of the resulting sequence in the scratchpad memory. If a merge operation is not 
allowed, our method records current computation position as a checkpoint and dumps the result¬ 
ing sequence from the scratchpad memory to the global memory. Then the host allocates more 
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Figure 4: An example of the merge method. The input sequence is in the register file. Its mask sequence and the resulting 
sequence are in the scratchpad memory. 


global memory (we use 2x each time) and re-launches kernel with a 2x large scratchpad memory 
setting. The relaunched kernels obtain checkpoint information, and load existing results to the 
scratchpad memory and continue the computation. The global memory dumping and reloading 
bring an extra overhead, but actually it does not affect the total execution time too much because 
of three reasons: (1) the global memory access is almost completely coalesced, (2) the latency 
could be hidden by subsequent computation, and (3) this overhead is only a small factor of large 
computation (short rows normally do not face this problem). For very long rows that exceed 
the scratchpad memory capacity, our method still allocates a space in the scratchpad memory 
as a level-1 merge sequence, executes the same merge operations on it and merges the level-1 
sequence in the scratchpad memory and the resulting sequence in the global memory only once 
before the kernel is ready to return. 

It is worth noting that the parameters of the binning depends on specifications (e.g., thread 
bunch size and scratchpad memory capacity) of GPU architectures. In this paper, we use the 
abovementioned fixed-size parameters for assigning the rows into the bins since the current 
nVidia GPUs and AMD GPUs have comparable hardware specifications. However, the strategies 
in stages 2 and 3 can be easily extended for future GPUs with changed architecture designs. 

The fourth stage, arranging data, first sums the numbers of nonzero entries in all rows of 
the resulting matrix C and allocates its final memory space. Then our method copies existing 
nonzero entries from the temporary matrix C to the resulting matrix C. For the rows in the bin 
group 1, the copy operation is not required. For the rows in the bin group 2, we use one thread 
for each row. For the rest of the rows in the bin groups 3-5, we use one thread group for each 
row. After all copy operations, the SpGEMM computation is done. 

4.2. Evaluating GPU Merge algorithms 

Because both the binary search and the prefix-sum scan take fast logarithmic time for each 
entry in the input sequence, these operations have relatively good efficiency and performance 
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(b) 


(c) 



(d) 


(e) 


(f) 



(g) 



Sub-sequence size (log) 

(h) 



10 11 12 


10 11 12 


Sub-sequence size (log) 

(i) 


Figure 5: Performance compaiison of merging 32-bit keys, 32-bit key-32-bit value pairs and 32-bit key-64-bit value pairs 
through 5 GPU merge algorithms: ranking merge, merge path, basic oddeven merge, basic bitonic merge and advanced 
bitonic merge on three different GPUs: nVidia GeForce GTX Titan Black, nVidia GeForce GTX 980 and AMD Radeon 
R9 290X. 


stability on modern GPUs. Therefore, a fast merge algorithm is very crucial for the performance 
of the merge method in our SpGEMM framework. 

47 ! 3 [ 4 ^ have been proposed for 


Recently some new merge algorithms Il44145L [4 


GPUs. But which one is the fastest in practice is still an open question. Because the main objec¬ 
tive of the research | ^ , Slid] is efficiently merging large data in the global memory, they still 
use basic methods, such as bitonic sort and ranking-based merge, as building blocks for small 
data in the scratchpad memory. Peters et al. ll47n proposed a locality-oriented advanced bitonic 
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sort method that can reduce synchronization overhead by merging data in fast private memory 
instead of relatively slow shared memory. Therefore we evaluate 5 GPU merge algorithms; 
(1) ranking merge H, (2) merge path (3) basic oddeven merge ll^ . (4) basic bitonic 
merge 1461] . and (5) advanced bitonic merge 1471] . The implementation of the algorithm (2) is 
extracted from the Modern GPU library 15 ill . The implementations of the algorithm (3) and (4) 
are extracted from the nVidia CUBA SDK. We implement the algorithm (1) and (5). Addition¬ 
ally, another reason why we conduct the evaluation is that none of the above literature presented 
performance of merging short sequences of size less than 2'^, which is the most important length 
(consider the nnzr(C) in Table |2]l for our SpGEMM. 

Our evaluation results of merging 32-bit keys, 32-bit key-32-bit value pairs and 32-bit key- 
64-bit value pairs are shown in Figure |5] The experimental platforms are described in Table [1] 
Each of the five algorithms merges two short ordered sequences of size / into one ordered output 
sequence of size 21. The sorting network methods in our evaluation only execute the last stage, 
since both inputs are sorted. To saturate throughput of GPUs, the whole problem size is set to size 
2^^. For example, 2^^ thread groups are launched while each of them merges two sub-sequences 
of size I = 2'®. We execute each problem set through multiple thread groups of different sizes 
and record the best performance for the evaluation. 

In Figure |5] we can see that the GPU merge path algorithm almost always outperforms other 
methods while sub-sequence size is no less than 2*. Since our merge method starts from size 
256, the merge path method is chosen for our SpGEMM implementation. The extra advantages 
of the merge path method are that it can evenly assign work load to threads and can easily deal 
with the input sequences of arbitrary sizes. Detailed description and complexity analysis of the 
GPU merge path algorithm can be found in 14511 . 

Other algorithms are not chosen because of various reasons. We can see that the ranking 
merge is faster than the merge path method in Figure|5jf)- But it is not chosen in our implemen¬ 
tation, since this algorithm is an out-of-place method that requires more scratchpad memory and 
thus cannot scale to longer sequences. Because the basic bitonic merge and the basic oddeven 
merge in general do not show better performance and cannot simply deal with data of arbitrary 
sizes, none of them is chosen. The advanced bitonic sort method is always the slowest because 
it loads data from the scratchpad memory to thread private memory (register file or an off-chip 
memory space) for data locality. However, due to the small or negatively large latency gap be¬ 
tween the scratchpad memory and the thread private memory, the load operations actually reduce 
the overall performance. Thus this method should only be used for migrating global memory ac¬ 
cess to scratchpad memory access. 

We can also see that the AMD Radeon R9 290X GPU is almost always much faster than the 
two nVidia GPUs in all tests. The reason is that the capacity of the scratchpad memory (2816 
kB, 64 kB/core x 44 cores, in the AMD GPU, 1536 kB, 96 kB/core x 16 cores, in the nVidia 
Maxwell-based GTX 980 GPU and 720 kB, 48 kB/core x 15 cores, in the nVidia Kepler-based 
GTX Titan Black GPU) heavily influence the performance of merging small sequences. For the 
same reason, the GTX 980 GPU delivers better overall performance than the GTX Titan Black 
GPU. On the other hand, even though the AMD GPU has 64 kB scratchpad memory per core, 
each instance of the kernel program can only use up to 32 kB. Thus the AMD GPU cannot scale 
to longer sub-sequences (e.g., 2^^ with 32-bit key-32-bit value pairs) that can be executed by 
using the nVidia GPUs. 
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5. Experimental Results 

5.1. Testbeds 

We use four platforms (one CPU and three GPUs) shown in Table [1] for evaluating the 
SpGEMM algorithms. The host side of all GPUs is a quad-core 3.7GHz CPU in an AMD AlO- 
7850K APU with 8 GB DDR3-1600 dual-channel system memory and 64-bit Ubuntu Linux 
14.04. 


Table 1: One CPU and three GPUs used for benchmarking 


Vendor 

Intel 

nVidia 

nVidia 

AMD 

Family 

Xeon CPU 

GeForce GPU 

GeEorce GPU 

Radeon GPU 

Device 

E5-2630 

GTX Titan Black 

GTX 980 

R9 290X 

Codename 

Sandy Bridge 

Kepler GKI10 

Maxwell GM204 

GCN Hawaii 

#Cores 

6 

15 

16 

44 

#SIMD units 

6x256-bit wide 

2880 CUDA cores 

2048 CUDA cores 

2816 Radeon cores 

Clock 

2.3 GHz 

889 MHz 

1126 MHz 

1050 MHz 

SP flop/cycle 

96 

5760 

4096 

5632 

SP Peak 

220.8 GFlop/s 

5120.6 GFlop/s 

4612.1 GFlop/s 

5913.6 GFlop/s 

DP flop/cycle 

48 

1920 

128 

704 

DP Peak 

110.4 GFlop/s 

1706.9 GFlop/s 

144.1 GFlop/s 

739.2 GFlop/s 

On-chip scratchpad 

N/A 

720 kB 

1536 kB 

2816 kB 

Memory 

32 GB DDR3-1333 (4 
channels) 

6 GB GDDR5 

4 GB GDDR5 

4 GB GDDR5 

Bandwidth 

42.6 GB/s 

336 GB/s 

224 GB/s 

345.6 GB/s 

OS (64-bit) 

Ubuntu 12.04 

Ubuntu 14.04 

Ubuntu 14.04 

Ubuntu 14.04 

Device driver 

N/A 

V344.16 

V344.16 

V14.41 

Compiler 

Intel C-I-+ vl4.0 

g++ v4.9, 
nvcc v6.5.19 

g++ v4.9, 
nvcc v6.5.19 

g++ v4.9, 

OpenCL vl.2 

Library 

Intel MKL vll.O 

CUSPvO.4.0, 
cuSPARSE v6.5, 
RMerge, 
bhSPARSE 

CUSP vO.4.0, 
cuSPARSE v6.5, 
RMerge, 
bhSPARSE 

bhSPARSE 


5.2. Performance Comparison for Galerkin Products 

Calculating Galerkin products plays an important role in AMG. We use smoothed aggre¬ 
gation prec onditioner with Jacobi smoother (described in (|2l and implemented in the CUSP 
library lll4ll l as a test scenario for evaluating SpGEMM algorithms. In each level of an AMG hi¬ 
erarchy in this context, we multiply three sparse matrices P^, A and P, where rectangular matrix 
P^ is a restriction operator, square matrix A is initially the system matrix, and rectangular matrix 
P is a prolongation operator. 

Eigures and Q show execution time of Galerkin products P^AP in constructing an AMG 
hierarchy (typically including 3-5 levels) for a smoothed aggregation preconditioner in sin¬ 
gle precision and double precision, respectively. The input system matrix A is from 2D 5- 
point, 2D 9-point, 3D 7-point or 3D 27-point Poisson problem, respectively. The two 2D prob¬ 
lems have dimensions 1024 x 1024 and generate system matrices of size 1048576 x 1048576. 
The two 3D problems have dimensions 101 x 101 x 101 and generate system matrices of size 
1030301 X 1030301. The SpGEMM approaches in three libraries, CUSP vO.4.0, cuSPARSE 
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v6.5 and bhSPARSlfl are tested on nVidia GeForce GTX Titan Black and GeForce GTX 980 
GPUs. To obtain the best SpGEMM performance, CUSP uses the coordinate (COO) format for 
its input matrices. The other two libraries use the CSR format. Because the operation multiplies 
three sparse matrices , A and P, the order of multiplication may influence overall performance. 
Here we test the two possible orders {P^A)P and P^iAP). In our experiments, matrix data trans¬ 
fer time between the host and the device is not included since the SpGEMM is normally one of 
the building blocks for more complex problem completely running on GPUs. 


□ CUSP HcuSPARSE IbhSPARSE GCUSP HcuSPARSE IbhSPARSE DCUSP HcuSPARSE IbhSPARSE DCUSP HcuSPARSE IbhSPARSE 

250.00 300.00 600.00 - 1200.00 - 



(a) 2D 5-point (b) 2D 9-point (c) 3D 7-point (d) 3D 27-point 


Figure 6: Execution time (in milliseconds) comparison of single precision SpGEMM (SpSGEMM) from three libraries 
CUSP, cuSPARSE and bhSPARSE in the context of smoothed aggregation preconditioner with Jacobi smoother. The 
system matrices are from four Poisson problems. Both (P^A)P and (AP) are tested on two nVidia GPUs. 


□ CUSP ■ cuSPARSE ■ bhSPARSE □ CUSP ■ cuSPARSE ■ bhSPARSE □ CUSP ■ cuSPARSE ■ bhSPARSE □ CUSP ■ cuSPARSE ■ bhSPARSE 

250.00 300.00 600.00 1200.00 



(a) 2D 5-point (b) 2D 9-point (c) 3D 7-point (d) 3D 27-point 


Figure 7: Execution time (in milliseconds) comparison of double precision SpGEMM (SpDGEMM) from three libraries 
CUSP, cuSPARSE and bhSPARSE in the context of smoothed aggregation preconditioner with Jacobi smoother. The 
system matrices are from four Poisson problems. Both (P^A)P and P^(AP) are tested on two nVidia GPUs. 


In Eigures |6] and |7] we can see that our method is constantly faster than SpGEMM algo¬ 
rithms in the other two libraries. When using system matrix from 3D 27-point Poisson problem, 
bhSPARSE delivers up to 2.6x and up to 2.7x speedups over cuSPARSE and CUSP, respectively. 
On average, speedups of 1.9x and 1.7x are achieved when compared with the above two libraries, 
respectively. 

As for the order of multiplication, we can see that our method in general gives better per¬ 
formance while doing P^(AP), compared to running (P^A)P. In contrast, the order of multi¬ 
plication does not bring obvious performance difference for CUSP. When cuSPARSE is used. 


^We call our libraiy bhSPARSE since this work is under the Project Bohrium f^ . 
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(P^A)P delivers better throughput for the two 2D problems, but degrades throughput for the two 
3D problems. 


5.3. Benchmark Suite for Evaluating Matrix Squaring 

We also evaluate multiplication of sparse square matrix and itself (i.e., C = A^) to avoid 
introducing another sparse matrix as a multiplier with different sparsity structure. We choose 
23 sparse matrices as our benchmark suite. 16 of them were widely used for performance eval¬ 
uations in previous sparse matrix computation research S SIB [Hill OS El El. The 
other 7 new matrices are chosen since they bring more diverse irregular sparsity structures that 
challenge the SpGEMM algorithm design. The variety of sparsity structures are from many 
application fields, such as hnite element methods, macroeconomic model, protein data, circuit 
simulation, web connectivity and combinational problem. All of the 23 matrices are download¬ 
able from the University of Florida Sparse Matrix Collection ifBl . Note that symmetry in the 
sparse matrices is not used in our SpGEMM algorithm, although some matrices in the bench¬ 
mark suite are symmetric. Also note that we use the standard CSR format that does not consider 
symmetric storage pattern. 

Besides the input matrix A, the work complexities of the different SpGEMM algorithms also 
depend on the intermediate matrix C and the resulting matrix C. So we list characteristics of the 
three matrices in Table |2] The set of characteristics includes matrix dimension (n), the number 
of nonzero entries (nnz) and the average number of nonzero entries in rows (nnzr). The upper 
9 matrices in the table have relatively regular nonzero entry distribution mostly on the diagonal. 
The other 14 matrices include various irregular sparsity structures. 


5.4. Performance Comparison for Matrix Squaring 

The single precision and double precision absolute performance of the SpGEMM algorithms 
that compute C = A^ are shown in Figures [8] andrespectively. Four GPU methods from CUSP 
vO.4.0, cuSPARSE v6.5, RMerge lllnl and bhSPARSE are evaluated on three GPUs; nVidia 
GeForce GTX Titan Black, nVidia GeForce GTX 980 and AMD Radeon R9 290X. One CPU 
method in Intel MKL vl 1.0 is evaluated on Intel Xeon E5-2630 CPU. The performance of an¬ 
other recent ESC-based GPU SpGEMM work OlSll is not included in the comparison because 
its source code is not available to us yet. The Intel MKL SpGEMM program is multithreaded 
and utilizes all six cores in the Intel Xeon CPU. For GPU algorithms, again, the host-device data 
transfer time is not included. 

We hrst compare the performance of the four different GPU SpGEMM algorithms on the 
nVidia GPUs. We can see that bhSPARSE always outperforms CUSP, cuSPARSE and RMerge 
on most sparse matrices in the benchmark suite. Compared to the two vendor supplied libraries, 
our method obtains better SpSGEMM and SpDGEMM performance on 21 and 21 matrices out 
of the whole 23 matrices over CUSP, and on 19 and 21 matrices over cuSPARSE, respectively. 
Compared to RMerge, another CUDA-specihc method, bhSPARSE achieves better SpSGEMM 
and SpDGEMM performance on 19 and 10 matrices on the GTX Titan Black GPU, and on 19 
and 20 matrices on the GTX 980 GPU. 

From the perspective of speedup, our method delivers on average 4.6x (up to 9.6x) and 3.lx 
(up to 8.8x) speedup on SpSGEMM performance over CUSP and cuSPARSE, and on average 
4.6x (up to 9.9x) and 3.lx (up to 9.5x) speedup on SpDGEMM performance over them, respec¬ 
tively. Compared to RMerge, our method offers on average 1.4x (up to 2.5x) speedup and 2.8x 
(up to 4.9x) speedup for SpSGEMM and on average l.Ox (up to 1.5x) and 2.lx (up to 3.4x) 
speedup for SpDGEMM on the GTX Titan Black GPU and GTX 980 GPU, respectively. 
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Table 2: Overview of sparse matrices for benchmarking matrix squaring. Here nnz(C) is the upper hound size of 
Numerically, nnz(C) equals to half of flops, the number of necessary aiithmetic operations while doing SpGEMM. 
nnz{C) is the number of nonzero entries in the resulting matrix C = A^. 


Name 

Plot 

n 

nnz(A), nnzr{A) 

nnz(C), nnzriC) 

nnz(C), nnzr(C) 

FEM/Cantilever 

\ 

63 K 

4M, 64 

269.5 M,4315 

17.4 M, 279 

Economics 


207 K 

1.3 M, 6 

7.6 M, 37 

6.7 M, 32 

Epidemiology 


526 K 

2.1 M,4 

8.4 M, 16 

5.2 M, 10 

Filter3D 

\ 

106 K 

2.7 M, 25 

86 M, 808 

20.2 M, 189 

Wind Tunnel 


218K 

11.6M, 53 

626.1 M, 2873 

32.8 M, 150 

FEM/Ship 


141 K 

7.8 M, 55 

450.6 M,3199 

24.1 M, 171 

FEM/Harboi- 

\ 

\ 

47 K 

2.4 M,51 

156.5 M, 3341 

7.9 M, 169 

Protein 


36 K 

4.3 M, 119 

555.3 M, 15249 

19.6 M, 538 

FEM/Spheres 


83 K 

6M, 72 

463.8 M, 5566 

26.5 M, 318 


2cubes_sphere 


102 K 

1.6 M, 16 

27.5 M, 270 

9M, 88 

1 

FEM/Accelerator 

1 riM 

121 K 

2.6 M, 22 

79.9 M, 659 

18.7 M, 154 

Cage 12 


130 K 

2M, 16 

34.6 M, 266 

15.2 M, 117 

Hood 


221 K 

10.8 M, 49 

562 M, 2548 

34.2 M, 155 

M133-b3 


200 K 

0.8 M, 4 

3.2 M, 16 

3.2 M, 16 

Majorbasis 

\ 

160 K 

1.8 M, 11 

19.2 M, 120 

8.2 M, 52 

Mario002 


390 K 

2.1 M, 5 

12.8 M, 33 

6.4 M, 17 

Mono_500Hz 

r’Msi 

169 K 

5M, 30 

204 M, 1204 

41.4 M, 244 

Offshore 

!% 

260 K 

4.2 M, 16 

71.3 M, 275 

23.4 M, 90 

Patents_main 


241 K 

0.6 M, 2 

2.6 M, 11 

2.3 M, 9 

Poisson3Da ^ 


14 K 

0.4 M, 26 

11.8 M, 871 

3M, 219 

QCD 


49 K 

1.9 M, 39 

74.8 M, 1521 

10.9 M, 222 

Circuit 


171 K 

1 M, 6 

8.7 M,51 

5.2 M,31 

Webbase ? 

- 

Ik 

1 M 

3.1 M, 3 

69.5 M, 70 

51.1 M, 51 


We can see that the cuSPARSE method outperforms our approach when and only when the 
input matrices are fairly regular (belong to the first 9 matrices in Table |2|i. For all irregular ma¬ 
trices and some regular ones, our bhSPARSE is always more efficient. On the other hand, the 
absolute performance of the CUSP method is very stable since its execution time almost only de¬ 
pends on the number of necessary arithmetic operations. Therefore this approach is insensitive 
to sparsity structures. Actually this insensitivity may bring better performance on matrices with 
some specific sparsity structures. However in most cases, the CUSP method suffers with higher 
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Figure 8: Single precision SpGEMM (SpSGEMM) GFlop/s comparison of 5 methods/libraries (MKL, CUSP, cuS¬ 
PARSE, RMerge and bhSPARSE) on 4 platforms (Intel Xeon E5-2630, nVidia GeForce GTX Titan Black, nVidia 
GeForce GTX 980 and AMD Radeon R9 290X). The performance of bhSPARSE is shown by the points on the lines. 
The bars plot the throughout of the other tested approaches. 


global memory pressure. The RMerge method offers significant speedups over the other meth¬ 
ods on three matrices (i.e., Epidemiology, M133-b3 and Mario002), which are characterized by 
short rows. However, for the other matrices, RMerge supplies relatively lower performance due 
to imbalanced workload and high-overhead global memory operations between iterative steps. 
Further, we can see that since RMerge mainly relies on computational power of the SIMD units, 
its performance decreases from GTX Titan Black (2880 CUDA cores running at 889 MHz) to 
GTX 980 (2048 CUDA cores running at 1126 MHz). In contrast, our method also depends on 
capacity of scratchpad memory. Thus we can see that bhSPARSE obtains better performance 
while using GTX 980 (1536 kB scratchpad) over GTX Titan Black (720 kB scratchpad). 

Compared to Intel MKL on the Intel CPU, our CUDA-based implementation on the nVidia 
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Figure 9: Double precision SpGEMM (SpDGEMM) GFlop/s comparison of 5 methods/libraries (MKL, CUSP, cuS¬ 
PARSE, RMerge and bhSPARSE) on 4 platforms (Intel Xeon E5-2630, nVidia GeForce GTX Titan Black, nVidia 
GeForce GTX 980 and AMD Radeon R9 290X). The performance of bhSPARSE is shown by the points on the lines. 
The bars plot the throughout of the other tested approaches. 


GPUs obtains better SpSGEMM and SpDGEMM performance on all 23 matrices, and delivers 
on average 2.5x (up to 5.2x) and 2.2x (up to 4.9x) SpSGEMM and SpDGEMM speedup, respec¬ 
tively. Our OpenCL-based implementation on the AMD GPU in the machine 2 obtains better 
SpSGEMM and SpDGEMM performance on 23 and 18 matrices, and delivers on average 2.3x 
(up to 4.2x) and 1.9x (up to 3.8x) SpSGEMM and SpDGEMM speedup, respectively. 

The relative performance (harmonic mean) of the SpGEMM algorithms that compute C - 
is shown in Eigure[T0] We can see that our method in general delivers the best performance on 
the used testbeds while running the 23 matrices as a benchmark suite. If we set the Intel MKL 
SpGEMM performance in this scenario as a baseline, our approach is the only GPU SpGEMM 
that constantly outperforms well optimized CPU method. 
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(a) Single precision SpGEMM (b) Double precision SpGEMM 


Figure 10: Average (harmonic mean) relative performance comparison of the 23 matrices, using SpGEMM method in 
MKL on Intel Xeon E5-2630 as a baseline. 


5.5. Memory Pre-allocation Comparison 

Figure [TT] shows the comparison of the three memory pre-allocation methods, while bench¬ 
marking C = A^. We can see that, for small matrices (e.g., 2cubessphere), our hybrid method 
shows exactly the same space requirements as the upper bound method does. However, for large 
matrices, allocated memory sizes through our hybrid method are much closer to the memory sizes 
allocated by the precise method. Taking the matrix Protein as an example, our hybrid method 
requires 2.7x memory space over the precise method, while the upper bound method needs 20.6x 
space requirement. One exception is the matrix Webbase, our hybrid method actually allocates 
more memory space than the upper bound method. The reasons are that the reduced rate of the 
intermediate matrix C to the resulting matrix C is very low (see Table |2]l and our 2x progression 
mechanism just allocates memory across the upper bound size. But overall, our hybrid method 
saves space allocation of the upper bound method and execution time of the precise method 
without introducing any significant extra space requirements. 

5.6. Using Re-allocatable Memory 

For some matrices with relatively long rows in the bin group 5, our method dumps scratchpad 
data to global memory, allocates a larger memory block, copies the old data to the newly allocated 
portion, reloads values and continues processing. We have to do the allocation/copy operation 
pair and pay the overhead since current GPUs are not able to re-allocate memory (i.e., change the 
size of the memory block pointed to a certain pointer). However, the emerging heterogeneous 
processors with shared virtual memory (or unified memory) address space deliver a possibility 
that lets integrated GPUs use system memory, which is re-allocatable from the CPU side. 

We evaluated two memory allocation strategies (i.e., a typical allocation/copy approach and 
an improved re-allocation approach) of our OpenCL-based SpGEMM algorithm on the GPU part 
(8 GCN core, 512 Radeon cores running at 1028 MHz, 1052.7 GFlop/s SP peak, 65.8 GFlop/s DP 
peak) in the AMD A10-7850K APU. Figure [TSlshows the results. We can see that re-allocatable 
memory brings on average 1.2x (up to 1.6x) speedup and on average 1.2x (up to 1.8x) speedup for 
SpSGEMM and SpDGEMM, respectively. Therefore, our GPU SpGEMM method may deliver 
further performance improvement on future GPUs with re-allocatable memory, or on emerging 
heterogeneous processors composed of CPU cores and GPU cores. Moreover, both CPU cores 
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Figure 11: Global memory requirement comparison of the precise method, our hybrid method and the upper bound 
method, when benchmarking C = on the 23 matrices. The memory requirement of the precise method includes the 
two input matrices and the resulting matrix. The memory requirements of the other two methods also contain additional 
intermediate matrices. “Hmean” refers to harmonic mean. 


and GPU cores can be utilized for Stage 3 in our framework. We leave this heterogenous work¬ 
load partitioning (similar to the methods described in |55, 5^) to future work. 


6. Conclusion 

In this paper we demonstrated an efficient SpGEMM framework and corresponding algo¬ 
rithms on GPUs and emerging CPU-GPU heterogeneous processors for solving the three chal¬ 
lenging problems in the SpGEMM. In the two experimental scenarios using matrices with diverse 
sparsity structures as input, our SpGEMM algorithm delivered excellent absolute and relative 
performance as well as space efficiency over the previous GPU SpGEMM methods. Moreover, 
on average, our approach obtained around twice the performance of the start-of-the-art CPU 
SpGEMM method. Eurther, we showed that our method obtained higher performance on emerg¬ 
ing heterogeneous processors with re-allocatable memory. 
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Figure 12: C — performance of bhSPARSE running with and without re-allocatable memory on an AMD AlO- 
7850K APU. Note that only executable matrices that require memory re-allocation are included here. “Hmean” refers to 
harmonic mean. 
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